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Abstract. Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et 
al. (submitted to Theor. Comp. Set.) introduced order-preserving pattern 
matching. In this problem we are looking for consecutive substrings of 
the text that have the same "shape" as a given pattern. These results 
include a linear-time order-preserving pattern matching algorithm for 
polynomially-bounded alphabet and an extension of this result to pat- 
tern matching with multiple patterns. We make one step forward in the 
analysis and give an time randomized algorithm constructing 

suffix trees in the order-preserving setting. We show a number of applica- 
tions of order-preserving suffix trees to identify patterns and repetitions 
in time series. 



1 Introduction 

We introduce order-preserving suffix trees that can be used for pattern 
matching and repetition discovery problems in the order-preserving set- 
ting, in particular, to model finding trends in time series which appear 
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naturally when considering e.g. the stock market or melody matching of 
two musical scores. 

Two strings x, y of the same length over an integer alphabet are called 
order-isomorphic (or simply isomorphic), written x « y, if 

Vl<i,j<\x\ < x[j] y[i] < y[j]. 

Example 1. (5, 2, 7, 5, 1,4, 9, 4, 5) « (6, 4, 7, 6, 3, 5, 8, 5, 6), see Fig. [IJ 

The notion of order-isomorphism was introduced in [10] and [12]. Both 
papers independently study the problem of identifying all consecutive 
substrings of a string x that are order-isomorphic to a given string y, 
the so-called order-preserving pattern matching problem. If \x\ = n and 
\y\ = m, an 0{n+m log m) time algorithm for this problem is presented in 
both papers. Morover, [TU] presents extensions of this problem to multiple- 
pattern matching based on the algorithm of Aho and Corasick. 

The problem of order-preserving pattern matching has evolved from 
the combinatorial study of patterns in permutations. This field of study 
is concentrated on pattern avoidance, that is, counting the number of 
permutations not containing a subsequence which is order-isomorphic to 
a given pattern. Note that in this problem the subsequences need not 
to be consecutive. The first results on this topic were given by Knuth 
|11] (avoidance of 312), Lovasz |14| (avoidance of 213) and Rotem |16j 
(avoidance of both 231 and 312). On the algorithmic side, patten matching 
in permutations (as a subsequence) was shown to be NP-complete [3j and 
a number of polynomial-time algorithms for special cases of patterns were 
developed [1151719] . 

Structure of the paper. In Section [3] we give a formal definition of an 
order-preserving suffix tree and describe its basic properties. 

To obtain an efficient algorithm constructing such suffix trees in Sec- 
tion [4] we develop an offline character oracle based on orthogonal range 
counting. An 0( 1 " g 1 °^ i ) time randomized and offline algorithm construct- 
ing order-preserving suffix trees is obtained. It is based on a general frame- 
work of Cole and Hariharan [6] (or, alternatively, on the approach of Lee, 
Na and Park [13]). 

Finally in Section|6]present a number of applications of order- preserving 
suffix trees that generalize the results from |10|12] . These applications are 
based on classical applications of suffix trees, however a new combinato- 
rial insight is required to adapt the known tools to the order-preserving 
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setting. In particular, we consider order-preserving string matching and 
the problem of detecting the simplest order-preserving repetitions that 
we call op-squares. 

2 Preliminaries 

Let w be a string of length n over an integer alphabet E, w = W\ . . . w n . 
We assume that E is polynomially bounded in terms of n. By w[i . . j] we 
denote the substring Wi . .Wj. Denote by su/j the i-th suffix of w, that is, 
w[i . . n]. For any i G {1, . . . , n} define: 

prev < (w,i) = \{k : k < i, wu < Wi}\, 
prev = (w,i) = \{k : k < i, Wk = w%}\- 

We introduce codes of single positions and and codes of strings as follows: 

4>(w,i) = (prev < (w , i) , prev = {w , i)) 

code(w) = (<j)(w, 1), <p(w, 2), . . . , <p(w, n)). 

For a string w, define shape(w) as the lexicographically smallest string u 
over {0, 1, . . .} such that u ~ w. 




Fig. 1. Example of two order-isomorphic strings. Their codes are equal to 
(0,0) (0,0) (2,0) (1,1) (0,0) (2,0) (6,0) (2,1) (4,2) and their shapes are equal to 
(3,1,4,3,0,2,5,3,3). 



Observation 1 The code has an online property: the code of the i-th 
character does not depend on characters in positions to the right of i: if 
code(x) = code(y) then code{x) is a prefix of code(yz). (Note that the 
function shape does not have this property.) 

The following obvious fact is useful in the proof of the forthcoming lemma. 



3 



Observation 2 x ~ y 44> shape(x) = shape(y). 
Lemma 1. x ~ y code{x) = code{y). 

Proof. The (=>■) part of the equivalence follows from the definition of 
a code. As for the (<=) part, we show an algorithm that reconstructs 
shape{x) from code(x). Thus code(x) = code(y) implies that shape(x) = 
shape(y) which, in turn, implies x ~ y. 

The algorithm is as follows. Find the rightmost (0,0) in code(x) (it 
exists, since code(x) starts with a (0,0)). Find all elements of the form 
(0, z) to the right of this (0, 0). All these elements together with this (0, 0) 
are equal and they correspond to 0s in shape{x). Remove all these ele- 
ments and decrease the first coordinate of every other element in code{x) 
by the number of removed elements that were to its right and repeat the 
process from the beginning, identifying all Is, 2s etc in shape(x). □ 



3 Order-Preserving Suffix Trees 

Let us define the following family of strings: 

SufCodes(w) = {code(sufi)#, code(suf 2 )#, • • • , code(suf n )#}, 

see Fig. [2j The order-preserving suffix tree of w, denoted opSufTree(w), is 
a compacted trie of all the sequences in SufCodes(w). The opSufTree(w) 
contains O(n) leaves, hence its size is 0(n). 



suffixes of w: 

6820793145 
820793145 
20793145 
7 9 3 1 4 5 
7 9 3 1 4 5 
9 3 14 5 
3 14 5 
14 5 
4 5 
5 



Suf Codes (w) : 

1 3 5 2 1 4 5 1 
000242145* 
00232145* 
0121134* 
1 2 3 * 
2 3 * 
2 3 * 
12* 
1* 
# 



Fig. 2. SufC'odes(w) for w — (6, 8, 2, 0, 7, 9, 3, 1, 4, 5). In this example all the characters 
of the string w are distinct, hence for each i we have prev=(w, i) — and we can ignore 
the second components of <j>. It suffices to take the first component of the code. 
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As usual, only the explicit nodes (that is, branching nodes and leaves) 
of opSufTree(w) are stored. The leaves store starting positions of the 
corresponding suffixes. Each branching node stores its depth and one of 
the leaves in its subtree. Each inner node stores a suffix link that may 
lead to an implicit or an explicit node. 

Each edge stores the code only of its first character. The codes of 
all the remaining characters of any edge can be obtained using the so- 
called character oracle that can efficiently provide the code (^{suf^j) for 
any i and j (a decription of the character oracle construction is given in 
Section Q. 

Example 2. Consider the order-preserving suffix tree of the string 

w = (6,8,2,0,7,9,3,1,4,5), 
see Fig. [3j All Suf Codes (w) are given in Fig. [2| 




Fig. 3. The uncompacted trie of SufCodes(w) for w = (6,8,2,0,7,9,3,1,4,5) (to 
the left) and its compacted version which together with character oracle forms 
opSufTree(w) (to the right). 



4 Character Oracle 

We use a geometric approach: the computation of (f> for w corresponds to 
counting points in certain orthogonal rectangles in the plane. 
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Observation 3 Let us treat the pairs (i,Wi) as points in the plane. Then 
4>(sufj,i) = (a,b), where a is the number of points that lie within the 
rectangle A = [j,i — 1] x (—00, ittj) and b is the number of points in the 
rectangle B = [j, i — 1] X [w^, Wj\, see Fig. [7| 



® 
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B * 



® 



® 



123456789 * 



Fig. 4. Geometric illustration of the sequence w = (5, 4, 7, 5, 8, 6, 1, 5, 6). The elements 
Wi are represented as points (i, Wi). The computation of <f)(suf 2 , 8) = (2, 1) corresponds 
to counting points in rectangles A, B. 



The orthogonal range counting problem is defined as follows. We are given 
n points in a plane and we need to answer queries of the form: 

"how many points are contained in a given axis-aligned rectangle?" . 
An efficient solution to this problem was given by Chan and Patra§cu, 
see Theorem 2.3 in [1] which we state below as Lemma [2j We say that a 
point (p, q) dominates a point (p' , q') if p > p' and q > q' . 

Lemma 2. We can preprocess n points in the plane in 0(n^/logn) time, 
using a data structure with 0(n) words of space, so that we can count the 
number of points dominated by a query point in 0(logn/loglogn) time. 

One can easily observe that the offline orthogonal range counting can be 
reduced to the dominance problem described in Lemma [2] We use the 
solution from this lemma to build our character oracle. 

Lemma 3. Let w be a string of length n and let suf 1 ,...,suf n be its 
suffixes. After 0(n\/log n) time and 0{n) space preprocessing one can 
compute 4>(sufj,i) for any i, j in 0(logn/loglogn) time. 
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Proof. Due to Observation [3] our problem can be reduced to an orthogonal 
range counting problem. Using Lemma [2] we obtain a solution to this 
problem with the requested preprocessing and query time and space. □ 



5 Construction of Order-Preserving Suffix Trees 

We use the tools introduced by Cole and Hariharan [6] for the construction 
of suffix trees for quasi-suffix collections of strings. 



5.1 Quasi-sufnx Collections 

A family of strings S\ , . . . , S n is called a quasi-sufnx collection [6j if the 
following conditions hold: 

1. |Si| = n and \Si\ = |Si-i| — 1 for all i > 1. 

2. No Si is a prefix of another Sj. 

3. If Si and Sj have a common prefix of length I > then Si + i and Sj + \ 
have a common prefix of length at least I — 1. 

The suffix tree for a quasi-suffix collection is defined as a compacted trie 
of all the strings in the collection. 

Lemma 4. Let w be a string of length n. Then the strings in SufCodes(w) 
form a quasi-suffix collection. 

Proof. The conditions 1 and 2 of a quasi-suffix collection obviously hold. 
The condition 3 is a direct consequence of the common prefix property 

(cf. my- 

Claim 1 If code (ax) = code(by) then code(x) = code(y). 

Proof. Due to Lemma [TJ code(ax) = code(by) implies that ax ~ by. 
Hence, obviously x ~ y. Again due to Lemma [I] we have code(x) = 
code(y). □ 

Consequently, Suf Codes (w) satisfies all conditions for a quasi-suffix col- 
lection. □ 



5.2 Order-Preserving Suffix- Tree Construction 

Cole and Hariharan |6j provided a general framework for constructing 
suffix trees for quasi-suffix collections (Si). Assuming they are given a 
character oracle that provides the j-th character of any Si in 0(1) time, 
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Cole and Hariharan [6] can construct the suffix tree for a quasi-suffix 
collection in 0(n) time and space with almost inverse exponential fail- 
ure probability. This result assumes that Si are over an alphabet of size 
polynomial in n. 

We apply this result to obtain an order-preserving suffix tree by using 
the character oracle that we developed in Section |4j 

Theorem 1. The order-preserving suffix tree of a string of length n can 
be constructed in 0( nl ° gw ) randomized time. 

\ log log n ' 

Proof. Due to Lemma |4j we can apply the algorithm of Cole and Hariha- 
ran [Bj. We use the character oracle from Lemma [3| Cole and Hariharan 
[6] call the oracle O(n) time, which gives 0( 1 " g °^ t ) total construction 
time. □ 



The framework of Cole and Hariharan [6] is based on McCreight's al- 
gorithm for suffix tree construction [15] which is an offline algorithm. Re- 
cently Lee, Na and Park |13j presented a modified version of the algorithm 
from [6j that uses Ukkonen's suffix tree construction algorithm [18] which 
works online (the characters of the string can be given one at a time) . The 
construction of Lee, Na and Park is designed only for parameterized suf- 
fix trees, however it works also in the general quasi-suffix setting, hence, 
in particular, in the order-preserving setting. Using this construction, an 
alternative 0( 1 " g 1 °^ rt ) time algorithm for order-preserving suffix trees can 
be obtained. Unfortunately, our oracle does not work online and thus the 
resulting algorithm is still offline. 

6 Applications of Order-Preserving Suffix Trees 

The most common application of suffix trees is pattern matching with 
time complexity independent of the length of the text. With the aid of 
order-preserving suffix trees we obtain a similar result with an additional 
small factor in the time complexity which is due to the character oracle. 
This result is possible due to the "suffix-independence" property of our 
coding function, see Observation [T] 

Theorem 2. Assume that we have opSufTree(w) of a string w of length 
n. Given a pattern x of length m, one can check if x is a factor of w 
in O d^iog^ ) time and report all occurrences in (9( 1 ™ 1 1 °^ + Occ) time, 
where Occ is the number of occurrences. 
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Proof. First we build the character oracle for the pattern, this takes 
0{m^/\ogm) = Q( i^\og" ) time. To answer a query, we traverse down 
the edges of the suffix tree using the character oracles for the pattern and 
the text. If we are at a branching node of depth h, we check if there is 
an outgoing edge starting with <p(x{\ . . h],x[h]). Otherwise we are at an 
implicit node of depth h located on an edge leading to an explicit node 
that has some leaf i in its subtree. In this case we check if </>(x[l . . k], x[k]) 
equals 4>(w[i . . i + h], w[i + h]). 

This enables to find the locus of x in 0( ^°^ l ) time. Afterwards all 
the occurrences of x can be found in the usual way by traversing all nodes 
in the corresponding subtree. □ 

A string uv is called an order-preserving square (an op-square, in short) if 
u ~ v. The length of the op-square is defined as \uv\. Thus an op-square 
represents a repeating pattern in a time series. Using order-preserving 
suffix trees we can obtain algorithms for finding and reporting op-squares. 

Note that each string of length at least 2 contains an op-square of 
length 2. Hence, no such string is op-square- free. We show how to modify 
the square-detecting algorithm by Gusfield and Stoye [T7] to check, for 
each length k, if a given string w contains an op-square of length k. 

Branching squares. We say that a substring w[i..i + 2k — 1] is a 
branching square 'tfw[i . .i+k—1] = w[i+k . .i+2k—l] and w[i+2k] w[i\. 
The algorithm from jTTj uses the suffix tree of a text w, \w\ = n, to find all 
branching squares in w in 0(n log n) time. Each such square is detected 
when inspecting the edges outgoing from the explicit node corresponding 
to the string w[i . . i + k — 1] . 

Non-extendible and non-shiftable op-squares. We say that an op- 
square w[i . . i + 2k — 1] is non-extendible if 

w[i . . i + k — 1] ~ w[i + k . . i + 2k — 1] 

and 

w[i . . i + k] ft w[i + k . .i + 2k]. 

A non-shiftable op-square is defined similarly but with the last condition 
substituted by 

w[i + 1 . .% + k] ft w[i + k + 1 . .% + 2k]. 

When we apply algorithm from |17j to the order-preserving suffix tree, 
we find all non-extendible op-squares. It suffices to prove the following 
property. 
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Lemma 5. If w contains an op-square of a given length then it contains 
a non- extendible op-square of the same length. 

Proof. Let w[i ■ ■ i + 2k — 1] be the rightmost op-square of length 2k in w. 
Then it is a non-shiftable op-square: 

w[i + 1 . .i + k] 96 w [i + k + 1 . .% + 2k]. 

Hence, 

w[i . .i + k] 96 w[i + k . . i + 2k] 
and consequently w[i ■ ■ i + 2k — 1] is a non-extendible op-square. □ 

Consequently we obtain an efficient algorithm for detecting an op-square 
of a given length. Note that the algorithm does not require to query the 
character oracle, it only processes the skeleton of the suffix tree. 

Theorem 3. For a string w of length n, after 0(n log n) time preprocess- 
ing one can check if w contains an op-square of a given length in 0(1) 
time. 

The algorithm of Gusfield and Stoye can also compute all the oc- 
currences of squares in a string in additional time proportional to the 
number of reported occurrences. For this, it starts at every branching 
square w[i . . i + 2k — 1] and shifts it to the left position- by-position as 
long as it forms a square, i.e. as long as w[i—j] = w[i + k — j], j = 1,2,... 

A generalization of this algorithm to op-squares requires efficient test- 
ing if an op-square can be shifted to the left. This could be done using the 
character oracle for the reversed text, however, there is a more efficient 
solution. 

Theorem 4. All order-preserving squares in a string w of length n can 
be computed in 0(n log n + Occ) time, where Occ is the total number of 
occurrences of op-squares. 

Proof. We use the following fact. 

Claim 2 The string w[i . . i + 2k — 1] is an op-square if and only if the 
lowest common ancestor (LCA) node of the leaves of opSuf'Tree(w) cor- 
responding to suf i and suf i+k has depth at least k. 

After 0(n) preprocessing time, LCA of nodes in a tree can be com- 
puted in O(l) time [8]. By the claim we can keep shifting the non- 
extendible op-square to the left. We stop either when the tested sub- 
string is not an op-square or when we encounter another non-extendible 
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op-square, the latter situation is possible since non-extendible op-squares 
can still be shiftable. We obtain an algorithm with required complexity. 

□ 
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